摘要 :
Convolutional neural networks (CNNs) are at the core of many state-of-the-art deep learning models in computer vision, speech, and text processing. Training and deploying such CNN-based architectures usually require a significant ...
展开
Convolutional neural networks (CNNs) are at the core of many state-of-the-art deep learning models in computer vision, speech, and text processing. Training and deploying such CNN-based architectures usually require a significant amount of computational resources. Sparsity has emerged as an effective compression approach for reducing the amount of data and computation for CNNs. However, sparsity often results in computational irregularity, which prevents accelerators from fully taking advantage of its benefits for performance and energy improvement. In this paper, we propose CSCNN, an algorithm/hardware co-design framework for CNN compression and acceleration that mitigates the effects of computational irregularity and provides better performance and energy efficiency. On the algorithmic side, CSCNN uses centrosymmetric matrices as convolutional filters. In doing so, it reduces the number of required weights by nearly 50% and enables structured computational reuse without compromising regularity and accuracy. Additionally, complementary pruning techniques are leveraged to further reduce computation by a factor of $2.8-7.2\times $ with a marginal accuracy loss. On the hardware side, we propose a CSCNN accelerator that effectively exploits the structured computational reuse enabled by centrosymmetric filters, and further eliminates zero computations for increased performance and energy efficiency. Compared against a dense accelerator, SCNN and SparTen, the proposed accelerator performs $3.7\times $, $1.6\times $ and $1.3\times $ better, and improves the EDP (Energy Delay Product) by $8.9\times $, $2.8\times $ and $2.0\times $, respectively.
收起
摘要 :
Convolutional neural networks (CNNs) are at the core of many state-of-the-art deep learning models in computer vision, speech, and text processing. Training and deploying such CNN-based architectures usually require a significant ...
展开
Convolutional neural networks (CNNs) are at the core of many state-of-the-art deep learning models in computer vision, speech, and text processing. Training and deploying such CNN-based architectures usually require a significant amount of computational resources. Sparsity has emerged as an effective compression approach for reducing the amount of data and computation for CNNs. However, sparsity often results in computational irregularity, which prevents accelerators from fully taking advantage of its benefits for performance and energy improvement. In this paper, we propose CSCNN, an algorithm/hardware co-design framework for CNN compression and acceleration that mitigates the effects of computational irregularity and provides better performance and energy efficiency. On the algorithmic side, CSCNN uses centrosymmetric matrices as convolutional filters. In doing so, it reduces the number of required weights by nearly 50% and enables structured computational reuse without compromising regularity and accuracy. Additionally, complementary pruning techniques are leveraged to further reduce computation by a factor of $2.8-7.2\times $ with a marginal accuracy loss. On the hardware side, we propose a CSCNN accelerator that effectively exploits the structured computational reuse enabled by centrosymmetric filters, and further eliminates zero computations for increased performance and energy efficiency. Compared against a dense accelerator, SCNN and SparTen, the proposed accelerator performs $3.7\times $, $1.6\times $ and $1.3\times $ better, and improves the EDP (Energy Delay Product) by $8.9\times $, $2.8\times $ and $2.0\times $, respectively.
收起
摘要 :
Graph convolutional neural networks (GCNs) have emerged as an effective approach to extend deep learning for graph data analytics. Given that graphs are usually irregular, as nodes in a graph may have a varying number of neighbors...
展开
Graph convolutional neural networks (GCNs) have emerged as an effective approach to extend deep learning for graph data analytics. Given that graphs are usually irregular, as nodes in a graph may have a varying number of neighbors, processing GCNs efficiently pose a significant challenge on the underlying hardware. Although specialized GCN accelerators have been proposed to deliver better performance over generic processors, prior accelerators not only under-utilize the compute engine, but also impose redundant data accesses that reduce throughput and energy efficiency. Therefore, optimizing the overall flow of data between compute engines and memory, i.e., the GCN dataflow, which maximizes utilization and minimizes data movement is crucial for achieving efficient GCN processing.In this paper, we propose a flexible and optimized dataflow for GCNs that simultaneously improves resource utilization and reduces data movement. This is realized by fully exploring the design space of GCN dataflows and evaluating the number of execution cycles and DRAM accesses through an analysis framework. Unlike prior GCN dataflows, which employ rigid loop orders and loop fusion strategies, the proposed dataflow can reconFigure the loop order and loop fusion strategy to adapt to different GCN configurations, which results in much improved efficiency. We then introduce a novel accelerator architecture called GCNAX, which tailors the compute engine, buffer structure and size based on the proposed dataflow. Evaluated on five real-world graph datasets, our simulation results show that GCNAX reduces DRAM accesses by a factor of $8.1 \times$ and $2.4 \times$, while achieving $8.9 \times, 1.6 \times$ speedup and $9.5 \times$, $2.3 \times$ energy savings on average over HyGCN and AWB-GCN, respectively.
收起
摘要 :
Graph convolutional neural networks (GCNs) have emerged as an effective approach to extend deep learning for graph data analytics. Given that graphs are usually irregular, as nodes in a graph may have a varying number of neighbors...
展开
Graph convolutional neural networks (GCNs) have emerged as an effective approach to extend deep learning for graph data analytics. Given that graphs are usually irregular, as nodes in a graph may have a varying number of neighbors, processing GCNs efficiently pose a significant challenge on the underlying hardware. Although specialized GCN accelerators have been proposed to deliver better performance over generic processors, prior accelerators not only under-utilize the compute engine, but also impose redundant data accesses that reduce throughput and energy efficiency. Therefore, optimizing the overall flow of data between compute engines and memory, i.e., the GCN dataflow, which maximizes utilization and minimizes data movement is crucial for achieving efficient GCN processing.In this paper, we propose a flexible and optimized dataflow for GCNs that simultaneously improves resource utilization and reduces data movement. This is realized by fully exploring the design space of GCN dataflows and evaluating the number of execution cycles and DRAM accesses through an analysis framework. Unlike prior GCN dataflows, which employ rigid loop orders and loop fusion strategies, the proposed dataflow can reconFigure the loop order and loop fusion strategy to adapt to different GCN configurations, which results in much improved efficiency. We then introduce a novel accelerator architecture called GCNAX, which tailors the compute engine, buffer structure and size based on the proposed dataflow. Evaluated on five real-world graph datasets, our simulation results show that GCNAX reduces DRAM accesses by a factor of $8.1 \times$ and $2.4 \times$, while achieving $8.9 \times, 1.6 \times$ speedup and $9.5 \times$, $2.3 \times$ energy savings on average over HyGCN and AWB-GCN, respectively.
收起
摘要 :
As machine learning technology continues to advance rapidly, an increasing number of researchers are utilizing it in the field of malware detection. Despite the fact that learning-based malware detection systems (LB-MDS) outperfor...
展开
As machine learning technology continues to advance rapidly, an increasing number of researchers are utilizing it in the field of malware detection. Despite the fact that learning-based malware detection systems (LB-MDS) outperform traditional feature-based detection methods in terms of both performance and detection speed, recent research has shown that they are susceptible to attacks from adversarial examples. However, the adversarial examples generated thus far have only been effective against individual LB-MDS and have not been able to simultaneously attack multiple LB-MDS.In this paper, we propose a black-box adversarial attack framework called Multi-Target Malware Generation (MTMG), which leverages reinforcement learning to simultaneously attack multiple LB-MDS. MTMG selects the obfuscation method and its corresponding parameters from the action space based on the observed state of the malware, and then applies them to generate adversarial examples that deceive multiple LB-MDS. Our results indicate that when simultaneously attacking multiple LB-MDS, including EMBER, MalConv, and six commercial antivirus software, MTMG significantly outperforms the state-of-the-art (SOTA) works, achieving an impressive attack success rate over 82%, while the SOTA works achieve a success rate of less than 6%.
收起
摘要 :
The current human pose estimation network has difficulty to be deployed on lightweight devices due to its large number of parameters. An effective solution is knowledge distillation, but there still exists the problem of insuffici...
展开
The current human pose estimation network has difficulty to be deployed on lightweight devices due to its large number of parameters. An effective solution is knowledge distillation, but there still exists the problem of insufficient learning ability of the student network: (1) There is an error avalanche problem in multi-teacher distillation. (2) There exists noise in heatmaps generated by teachers, which causes model degradation. (3) The effect of self-knowledge distillation is ignored. (4) Pose estimation is considered to be a regression problem but people usually ignore that it is also a classification problem. To address the above problems, we propose a densely guided self-knowledge distillation framework named DSKD to solve the error avalanche problem, propose a binarization operation to reduce the noise of the teacher's heatmaps, and add a classification loss to the total loss to guide student's learning. Experimental results show that our method effectively improves the performance of different lightweight models.
收起
摘要 :
Planetary exploration rovers are designed to perform challenging tasks in highly variable, rough terrain. Suspension is an important segment for locomotion system of exploration rover to distribute loads on wheels and enhance rove...
展开
Planetary exploration rovers are designed to perform challenging tasks in highly variable, rough terrain. Suspension is an important segment for locomotion system of exploration rover to distribute loads on wheels and enhance rover smoothness and terrain adaptability. Moreover a folded-deployed suspension has important meaning for reduction volume and enhancing carrying reliability of the rover. To design a folded-deployed suspension with minimized motor numbers for the planetary surface locomotion system of a six-wheeled rover, we present a dual rocker-bogie scheme that can multi-stage synchronously unfold driven by a single motor and clutches. Further we proposed the deployment manner, mechanical architectures and action program for the locomotion system. After analysis of the influence of suspension structure parameters selection, the suspension parameters in deployed configuration are optimized under constrains of the folded-deployed ratio and load sharing characteristic on wheels. The obstacle-climbing capability of the locomotion system is discussed by establishing quasi-static model and simulation. The effect of centroid pitching transformation of the locomotion system on the obstacle-overcoming capability and rover locomotion smoothness is evaluated by simulation.
收起
摘要 :
Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel typ...
展开
Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel types supported by computing engine and the dominant parallel types of CNN workloads. This mismatch seriously degrades resource utilization of existing accelerators. In this paper, we propose aflexible dataflow architecture (FlexFlow) that can leverage the complementary effects among feature map, neuron, and synapse parallelism to mitigate the mismatch. We evaluated our design with six typical practical workloads, it acquires 2-10x performance speedup and 2.5-10x power efficiency improvement compared with three state-of-the-art accelerator architectures. Meanwhile, FlexFlow is highly scalable with growing computing engine scale.
收起
摘要 :
Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel typ...
展开
Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel types supported by computing engine and the dominant parallel types of CNN workloads. This mismatch seriously degrades resource utilization of existing accelerators. In this paper, we propose aflexible dataflow architecture (FlexFlow) that can leverage the complementary effects among feature map, neuron, and synapse parallelism to mitigate the mismatch. We evaluated our design with six typical practical workloads, it acquires 2-10x performance speedup and 2.5-10x power efficiency improvement compared with three state-of-the-art accelerator architectures. Meanwhile, FlexFlow is highly scalable with growing computing engine scale.
收起
摘要 :
Sparse-matrix vector multiplication (SpMV) is a core routine in many applications. Its performance is limited by memory bandwidth, which is for matrix transport between processors and memory, and instruction latency in computation...
展开
Sparse-matrix vector multiplication (SpMV) is a core routine in many applications. Its performance is limited by memory bandwidth, which is for matrix transport between processors and memory, and instruction latency in computations. Vectorized operations (SIMD) can dramatically improve the execution efficiency, but irregular matrices' sparsity pattern is not compatible with the style of SIMD execution. We present a new matrix format, Compressed Sparse Column Vector (CSCV), and a corresponding vectorized SpMV algorithm for matrices arising from integral equations. This SpMV algorithm can inherently suit wide SIMD instructions and reduce the memory bandwidth used. We implement this algorithm for Computed Tomography (CT) imaging reconstructions on both Intel and AMD x86 platforms and compare it with seven state-of-the-art SpMV implementations using different CT imaging matrices. Experimental results show that CSCV can achieve up to 96.9 GFLOP/s in single-precision tests, with speedup 3.70× to MKL and 3.48× to the second place implementation. Furthermore, the implementation of CSCV SpMV is performance portable, which excludes almost all SIMD assemble code and has promising performance with compiler-assisted vectorization. Code Availability: https://github.com/sysu-compsci/cscv
收起